Deep learning application of vertebral compression fracture detection using mask R-CNN

Vertebral compression fractures (VCFs) of the thoracolumbar spine are commonly caused by osteoporosis or result from traumatic events. Early diagnosis of vertebral compression fractures can prevent further damage to patients. When assessing these fractures, plain radiographs are used as the primary diagnostic modality. In this study, we developed a deep learning based fracture detection model that could be used as a tool for primary care in the orthopedic department. We constructed a VCF dataset using 487 lateral radiographs, which included 598 fractures in the L1-T11 vertebra. For detecting VCFs, Mask R-CNN model was trained and optimized, and was compared to three other popular models on instance segmentation, Cascade Mask R-CNN, YOLOACT, and YOLOv5. With Mask R-CNN we achieved highest mean average precision score of 0.58, and were able to locate each fracture pixel-wise. In addition, the model showed high overall sensitivity, specificity, and accuracy, indicating that it detected fractures accurately and without misdiagnosis. Our model can be a potential tool for detecting VCFs from a simple radiograph and assisting doctors in making appropriate decisions in initial diagnosis.


Study settings
In this study, approximately 70% (346 radiographs) of the dataset were used to train the neural network, and approximately 15% each were allocated to validation (71 radiographs) and test data (70 radiographs).Train, validation, test data were split in a stratified manner to consider classwise distribution.Radiographs with no fractures were used only in the test phase.We used stochastic gradient descent considering momentum as the optimization method.The learning rate was set to decrease with a weight decay of 0.0001 and a momentum of 0.9.We used transfer learning 17 to enhance model performance.Each model was trained starting from the pretrained weights of the COCO instance segmentation dataset 18 .Augmentation of horizontal flip and random rotation of 10 degrees were applied.The summary of our VCF dataset is listed in Table 1.

Mask R-CNN
Mask R-CNN is an instance segmentation model based on the Faster R-CNN model 19 .Mask R-CNN 20 introduced the segmentation branch, which is composed of four convolutions, one deconvolution, and one convolution to process instance segmentation.Moreover, ROI Align was introduced to fix the information loss of ROI pooling due to the misalignment of feature maps and ROIs (Region of Interest), and significantly improved the segmentation accuracy.The backbone of Mask R-CNN is ResNet 21 and Feature Pyramid Networks (FPN) 22 .The backbone used residual learning to precisely extract object features, and a feature pyramid to fuse multi-scale features to construct high-quality feature maps.Subsequently, ROIs were extracted from the feature maps using region proposal networks (region proposal networks).The ROIs were then aligned and pooled by ROI Align.After the pooling layer, the model predicted segmentation masks using fully convolution networks.The structure of www.nature.com/scientificreports/Mask R-CNN is shown in Fig. 2. Mask R-CNN has several applications in instance segmentation.Mask R-CNN incorporated the structure of previous RCNN models and improved them to make a faster, more accurate, and more effective instance segmentation model.

Backbone network
The backbone network was used to extract features from the input radiograph.We implemented ResNet101 with FPN as the backbone network to extract reliable feature maps.In the bottom-up pathway, ResNet extracts low-level features such as corners and edges of the object, while deeper layers extract high-level features such as texture and color.Then in the top-down pathway, FPN was used to concatenate feature maps of different scales to better represent objects.The feature maps were used in the RPN and ROI Align to generate candidate region proposals for detection.The structure of the backbone network is shown in Fig. 3.

Region proposal network and ROI align
The RPN generates ROIs using the feature map inputs from the backbone network.A 3 x 3 convolutional layer was used to scan the image using a sliding window to generate anchors for different scaled bounding boxes.Binary classification was performed to determine whether each anchor contained the object or represented a background.The structure of the RPN is shown in Fig. 3.The bounding box regression generated samples and calculated the intersection over union (IoU) value.If the sample had IoU higher than 0.7, it was defined as a positive sample, otherwise a negative sample.Non-maximum suppression (NMS) was applied to keep regions with the highest confidence score.The feature maps from the backbone network and ROIs from RPN were passed to ROI Align for pooling.ROI Align was performed next stage to obtain fixed size feature vectors and feature maps.ROI Align is a method proposed to avoid misalignment issues identified in the ROI pooling layer, which rounds the locations of the ROIs on the feature map.A bilinear interpolation operation was performed on the sample points in each grid cell before pooling.www.nature.com/scientificreports/

Mask prediction
The feature vector output from the previous stage was used to calculate the class probabilities of each ROIs for classification, and to train bounding box regressors to refine the location and size of the bounding box to accurately include each object.The mask branch predicted binary masks for each ROI classwise using fully convolutional network (FCN).

Evaluation metrics
True and False positives were defined by the value of the IoU.IoU was calculated by dividing the overlap between the predicted area and the ground truth area by the union of these.If the IoU of the predicted and actual regions exceeded a certain threshold, the detector's prediction was determined to be correct, and it was defined as True Positive (TP).On the contrary, if the IoU value was less than the threshold, the result was defined as False Positive (FP).When the detector failed to predict any fracture, it was defined as False Negative (FN).Specificity was calculated using the dataset containing no fractures.We defined True Negative (TN) when the detector did not predict any fractures from normal radiographs, and false detection as False Positive (FP).We calculated sensitivity, specificity, accuracy, and F1-score with the defined confusion matrix.Sensitivity is calculated with Eq. (1), specificity with Eq. ( 2), accuracy with Eq. ( 3), and F1-score with Eq. ( 4) The cumulative value was determined by listing the detected regions in order of confidence score.As the regions were listed, we calculated a precision-recall curve with the accumulated values and computed the AP from the area below.Mean average precision (mAP) was calculated as the average AP score of each class and evaluated as an overall evaluation metric among each DL models.AP scores were computed with Eq. ( 5).www.nature.com/scientificreports/

Ethical approval and consent to participants
This study was conducted according to the Helsinki declaration.This study was approved by the Institutional Review Boards of Korea University Ansan Hospital, and was conducted in accordance with the approved study protocol (IRB No. 2022AS0198).Due to the retrospective nature of this study, informed consent was waived by Korea University Ansan Hospital Institutional Review Board and Ethical Committee.

Results
We compared the outcomes of Mask R-CNN models with Cascade Mask R-CNN 23 , YOLOACT 24 , and YOLOv5 25 models in terms of diagnostic accuracy and detection performance.The diagnostic performance of models were compared in Table 2. Mask R-CNN model and YOLOv5-m showed the highest sensitivity of 79.8% and 78.7%, among the four models.Models with high sensitivity did not miss identifying patients with fractures.The Mask R-CNN model and the Cascade Mask R-CNN model had the highest specificity of 89.4% and 90.0%, respectively.However, considering that the Cascade R-CNN model had the lowest sensitivity, this model had a strong tendency to predict that there was no fracture.This indicates that the Mask R-CNN model was able to learn features of each vertebra without fracture through negative samples, which were determined as background from RPN. Mask R-CNN and YOLOv5-m model showed higher F1-score of 82.6 and 83.5 than other models, which meant they more likely classified X-ray images that contained fractures.The Mask R-CNN showed the highest accuracy of 85.7% and provided the best diagnosis among all models.
The detection performance of models were compared in Table 3.The AP score was computed when the IoU threshold was set to 0.5, and mAP was calculated as the classwise average AP scores.The Mask R-CNN model and the YOLOv5-x model showed the highest mAP of 0.58 compared to others.However, there were differences in detection performance by category, as there was a class imbalance in the collected dataset.For T12, L1, and L2 fractures, of which more than 100 fractures were collected, all models except Cascade Mask R-CNN achieved a mAP higher than 0.7 and showed great detection performance.In contrast, AP scores for T11 and L4 fractures were low for every model, for which a relatively small amount of data was collected.Mask R-CNN model achieved AP score of 0.78, 0.80, and 0.79 for each fractures.The YOLOv5 model with a higher number of model parameters and size showed a higher AP score, as the YOLOv5-x and YOLOv5-l models showed better performance than the YOLOv5-s and YOLOv5-m models.
Model prediction is shown in Fig. 4. Compared with the ground truth, all the models predicted fractures close to the actual area.Compared to other models, the Mask R-CNN model did not miss any fractures and performed better.Multiple fractures were detected as well.YOLOACT and YOLOv5 models predicted more fracture area as multiple fractures even when only one fracture existed.

Discussion
Many people are diagnosed with VCFs every day.The plain radiograph of the spinal region is the initial diagnostic method because they are quick and require low-cost compared with other diagnostic modalities 2 .Research on applying DL algorithms to the field of medical image diagnostics is actively being studied and showing good performance 4 .These works can lead to assisting medical experts by reducing the time required in diagnosis and increase accuracy, ultimately improving patient care.However, there is not much research on detecting VCFs by applying DL algorithms for several reasons.Despite the advantages radiograph have in diagnosis of VCFs, it is difficult to construct a large dataset of radiographs of VCFs, because radiographs are not used for a final www.nature.com/scientificreports/diagnosis.Also, labeling ground truth on each radiograph is very hard and labor-intensive because MRI findings must be referenced.Thus, previous studies have classified vertebral radiographs or performed segmentation of each vertebra and classified each bone by a two-step process.
From this perspective, this study shows interesting results.We created a VCF dataset labeled based on the MRI findings and trained segmentation deep learning models to diagnose and segment each fracture area.In terms of diagnose, Mask R-CNN and YOLOv5-m model achieved sensitivity of nearly 80%, and F1-score of over 80% (Table 2).For classes which were easier to collect such as L1, L2, T12 most segmentation models achieved an AP score of over 0.7, indicating that they were able to accurately locate fracture regions pixel-wise (Table 3).The results showed that these models were able to successfully distinguish VCFs of patients, highlighting the potential to assist physicians in real-life (Fig. 4).However, there were some limitations in our study.First, the overall performance of the models was not sufficiently high.The most explainable reason for this is the small amount of data on certain classes such as T11, L3, L4 fractures.Also, we only targeted fractures of the thoracolumbar area from L1 to T12, which makes partial diagnose of VCFs.Nevertheless, our study showed and proposed a pipeline for directly locating and classifying fracture by utilizing deep learning segmentation model, Mask R-CNN.In the future, we plan to collect more lateral radiographs and train a model that targets all areas of vertebral body.
The aim of the study was to develop a model that could efficiently detect vertebral compression fractures of the thoracolumbar region in lateral radiographs.We constructed an MRI-based labeled VCF dataset.Subsequently, we trained and optimized DL based instance segmentation model Mask R-CNN.We compared detection and diagnosis performance results with those of other popular segmentation models, and showed that Mask R-CNN was the most appropriate model for detecting VCFs.Deep learning based VCF detection model can reduce time and cost spent on inspection of radiographs manually.If this model is used to assist doctors with the initial diagnosis of VCFs, faster diagnosis is possible and patients' treatment periods can be shortened.

Figure 1 .
Figure 1.Example of labeled data.Each fracture is labeled with a polygon on the thoracolumbar radiograph based on the MRI results.Each polygon mask contains the x and y coordinates of the polygon mask surrounding it.Each bounding box consists of upper left x, y coordinates, width, and height.The entire labeling process was conducted by two trained orthopedic experts.

Figure 3 .
Figure 3. Backbone network and region proposal network.(a) Backbone network is shown.Feature maps from ResNet are upsampled and resized with 1 x 1 convolution to be concatenated with different scaled feature maps.(b) The region proposal network generates candidate regions for objects by sliding-window, referred to as anchor box on feature maps.Each anchor box performs both classification and bounding box adjustments.

Figure 4 .
Figure 4. Examples of prediction results of each model.From left to right, actual spine lateral radiograph, ground-truth (expert-labeled fracture masks), and prediction from each model are shown.

Table 1 .
Summary of VCF dataset.

Table 2 .
Comparison of detection performances.

Table 3 .
Comparison of segmentation AP.